Important packages:
setwd("Dataset")
Warning: The working directory was changed to C:/Users/DELL/Documents/GitHub/Project_IT326/Dataset inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
Dataset <- read.csv("data.csv")
Preprocessed_dataset <- read.csv("preprocessed_dataset.csv")
if(!require(ggplot2)){
install.packages("ggplot2")}
library(ggplot2)
if(!require(dplyr)){
install.packages("dplyr")}
library(dplyr)
if(!require(ltm)){
install.packages("ltm")}
library(ltm)
if (!require(cluster, quietly = TRUE)) {
install.packages("cluster")
}
if (!require(factoextra, quietly = TRUE)) {
install.packages("factoextra")
}
if (!require(caret, quietly = TRUE)) {
install.packages("caret")
}
if (!require(party, quietly = TRUE)) {
install.packages("party")
}
if (!require(partykit, quietly = TRUE)) {
install.packages("partykit")
}
if (!require(rpart, quietly = TRUE)) {
install.packages("rpart")
}
if (!require(rpart.plot, quietly = TRUE)) {
install.packages("rpart.plot")
}
if (!require(tidyverse, quietly = TRUE)) {
install.packages("tidyverse")
}
if (!require(RWeka, quietly = TRUE)) {
install.packages("RWeka")
}
_________________________________________________________________________________________________________
Our primary objective of this analysis is to classify whatever a student will go to college or not we aim to discover the most influential variables and their relationships with the decision not to attend college to provide counseling session to the student to help them
using the classification methods and to identify the main factors and reasons why students are less likely to pursue higher education using class label indicated “will_go_to_college” being ‘False’. By leveraging the provided dataset with attributes such as school type, school accreditation, gender, interest in college, residence and use clustering to group pattern of the student who will go to collage or not
Kaggle.com
| Attribute | Description | Type | Possible values |
| type_school | The type of school the student attends | Binary | Academic / Vocational |
| school_accreditation | The quality if school | Binary | A / B (A is better than B) |
| gender | The student’s gender | Binary | Male / Female |
| interest | The student’s interest in going to college | Nominal | Very interested /Interested / Less Interested / Not Interested /Uncertain |
| residence | The student’s residence | Binary | Urban / Rural |
| parent_age | The age of the student’s parents | Numeric | 40 - 65 |
| parent_salary | The monthly salary of the student’s parents in IDR/Rupiah. [1Rupiah = 0.00024SAR] |
Numeric | 1000K - 10M |
| house_area | The size of the student’s house in meter square | Numeric | 20 - 120 |
| average_grades | The student’s average grades in school. | Numeric | 75 - 98 |
| parent_was_in_college | Whether the student’s parents attended college. | Binary | True - False |
| will_go_to_college | Whether the student plans to go to college. | Binary | True - False |
head(Dataset)
Here are a sample of 6 row from our dataset.
sum(is.na(Dataset))
[1] 0
There are no missing values in our dataset.
Graph 1:
df =data.frame(Dataset)
ggplot(data=df, aes(x = interest, fill = will_go_to_college)) +
geom_bar() +
scale_x_discrete(limits = c('Not Interested', 'Less Interested', 'Uncertain', 'Interested', 'Very Interested')) +
labs(title = 'College interest vs College attendance ') +
scale_fill_manual(values = c("True" = "antiquewhite2", "False" = "antiquewhite3")) +
theme_minimal()
NA
NA
According to the graph, whether students are interested in going to college or not does not affect whether they actually end up attending Collage . There is a group of individuals who were interested in attending but did not receive acceptance, while others who were not interested were accepted.
Graph 2:
filtered_True =filter(Dataset, will_go_to_college == 'True')
filtered_False =subset(Dataset, will_go_to_college =='False')
ggplot() +
geom_density(data = filtered_True, aes(x = average_grades, fill = "Going to College"), alpha = 0.5) +
geom_density(data = filtered_False, aes(x = average_grades, fill = "NOT Going to College"), alpha = 0.5) +
labs(x = "Average Grades", y = "Density") +
ggtitle("Comparison of Average Grades for Students Going to College and NOT Going to College") +
scale_fill_manual(values = c("Going to College" = "antiquewhite4", "NOT Going to College" = "antiquewhite1"))
The graph shows that the average grades for students who had accepted to go to college were higher than those who did not enter college , and this indicates the existence of a correlation between those who going to college and the average grades
Graph 3:
Dataset_percentage <- Dataset %>%
group_by(type_school) %>%
summarise(percentage = mean(will_go_to_college == "True") * 100)
# Create a percentage chart
ggplot(Dataset_percentage, aes(x = type_school, y = percentage, fill = type_school)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of Students Going to College by Type of School",
x = "Type of School",
y = "Percentage") +
scale_fill_manual(values = c("Academic" = "antiquewhite3", "Vocational" = "antiquewhite2")) +
theme_minimal()
NA
NA
This graph shows the impact of the type of high school attended by students on their college attendance . Based on the bar chart:
among students from Academic high schools, 313 are going to college, and 296 are not.
among students from Vocational high schools, 187 are going to college, and 204 are not
These information tell us that a higher proportion of students from academic high schools are going to college compared to those from vocational schools which suggest that the type of school attended.
Graph 4:
ggplot(Dataset, aes(x = average_grades)) +
geom_histogram(binwidth = 5, fill = "antiquewhite2", color = "antiquewhite4") +
labs(title = "Distribution of students' grades",
x = "Students' average grades",
y = "Frequency") +
theme_minimal()
This histogram show us that the majority of the students in the dataset are performing well since it seems like their grades are spanning between 75 and 98. This analysis will help us determine whether the academic performance level of students is a contributing factor to their college attendance or not.
summary(Dataset$average_grades)
The student grades in our dataset range from 75.00 to 98.00, with a median of 85.58 and an average of 86.10. This suggests that most students are doing well as none of them have average grades below 50. However, it’s interesting to note that some students have much higher or lower grades than the average, mainly due to the wide range of grades..
summary(Dataset$parent_salary)
In the dataset, we’ve got students parents with salaries ranging from 1,000,000 to 10,000,000 IDR/Rupiah. The median salary is 5,440,000 IDR/Rupiah, and the average is 5,381,570 IDR/Rupiah. This data tells us that many parents in our dataset earn less than the average salary in Indonesia, which is 146,000,000 IDR. This suggests that quite a few students in our dataset come from families with limited finances. And this financial situation could certainly impact their ability to get through college.
Additionally we can utilize the house area attribute to gain a deeper understanding of the socioeconomic status of students’ families, where students with houses significantly larger than the mean might indicate a higher socioeconomic status, while those with houses considerably smaller than the mean might reflect a comparatively lower socioeconomic status. Based on the shown output, the house areas range from [20.00-120.00 ㎡]. The median house area is 75.50 ㎡ indicates that families with house areas around this value likely have moderate socioeconomic status with houses that neither very small nor very large.
summary(Dataset$parent_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40.00 50.00 52.00 52.21 54.00 65.00
SD=sd(Dataset$parent_age)
MeanAge=mean(Dataset$parent_age)
cat("coefficient of variation:",SD/MeanAge*100,"%")
coefficient of variation: 6.704771 %
This summary provides the range for age attribute [40,65] which indicates that all parent in middle age during this age parent have more concern about their children , the coefficient of variation= 6.7% which indicates lower variation ,and the value of attribute parent_agerare are relatively close to the mean overall 25% of them have an age below or equal to 50 , 75% have an age below or equal to 54 and the median value is 52
we aim to increase the quality of data and transforming the data to make it suitable for analysis
###parent age outliers
quartiles <- quantile(Dataset$parent_age, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_age)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(Dataset, Dataset$parent_age > Lower & Dataset$parent_age < Upper)
dim(data_no_outlier)
[1] 957 11
###parent salary outliers
quartiles <- quantile(Dataset$parent_salary, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_salary)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$parent_salary> Lower & data_no_outlier$parent_salary < Upper)
dim(data_no_outlier)
[1] 955 11
###averge grades outliers
quartiles <- quantile(Dataset$average_grades, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$average_grades)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$average_grades> Lower & data_no_outlier$average_grades < Upper)
dim(data_no_outlier)
[1] 944 11
###house area outliers
quartiles <- quantile(Dataset$house_area, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$house_area)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$house_area> Lower & data_no_outlier$house_area < Upper)
Founded_Outliers=data.frame(anti_join(Dataset,data_no_outlier))
Joining with `by = join_by(type_school, school_accreditation, gender, interest, residence, parent_age, parent_salary, house_area, average_grades, parent_was_in_college, will_go_to_college)`
print(Founded_Outliers)
After conducting data analysis and identifying outliers, our inspection reveals that the detected outliers represent inherent variation within the population. Regarding Parent_age, outliers are observed for values below 44 and above 65. However, it should be noted the age from 40 to 65 fall within the expected mean of our dataset meaning that it doesn’t indicate that they are outliers . For parent_salary, we found two outliers: one below 1,326,250 ind ≈ 85 USD and another above 9,416,250 ind ≈ 606 USD. The minimum and maximum values were determined to be 1,000,000 ind ≈ 64 USD and 10,000,000 ind ≈ 644 USD, respectively. In the case of grades, twelve outliers were identified, ranging from below 76 to above 97. Nevertheless, since the data falls within the acceptable range of 0 to 100, these outliers should be retained as they are still considered normal and within the usual grade range. Finally, for house_area, we found eleven outliers below 34.4m and above 115m, with the minimum being 20m and the maximum being 120m. However, these values are still considered typical for the population.
normalize <- function(x) {return((x-min(x))/ (max(x)-min(x)))}
datasetWithoutNormalization<-Dataset
Dataset$parent_salary<-normalize(datasetWithoutNormalization$parent_salary)
Dataset$house_area<-normalize(datasetWithoutNormalization$house_area)
print(Dataset)
We applied normalization to the ‘parent_salary’ and ‘house_area’ attributes, scaling their values to a range between 0 and 1. This normalization process greatly facilitates data handling and analysis, ensuring that these attributes are on a consistent scale. Which will improve the reliability of our data analysis and enable better conclusions to be drawn from the dataset. Normalization is a crucial step in preparing the data for modeling, as it prevents attributes with larger numerical ranges from dominating the analysis and ensures fair treatment for all features.
Dataset$average_grades [Dataset$average_grades >= 95] <- '+A'
Dataset$average_grades [95 >Dataset$average_grades & Dataset$average_grades >= 90] <- 'A'
Dataset$average_grades [90 >Dataset$average_grades & Dataset$average_grades >= 85] <- '+B'
Dataset$average_grades [85 >Dataset$average_grades & Dataset$average_grades >= 80] <- 'B'
Dataset$average_grades [80 >Dataset$average_grades & Dataset$average_grades >= 75] <- '+C'
Dataset$average_grades [75 >Dataset$average_grades & Dataset$average_grades >= 70] <- 'C'
Dataset$average_grades [70 >Dataset$average_grades & Dataset$average_grades >= 65] <- '+D'
Dataset$average_grades [65 >Dataset$average_grades & Dataset$average_grades >= 60] <- 'D'
Dataset$average_grades [60 >Dataset$average_grades & Dataset$average_grades >= 0] <- 'F'
Dataset$average_grades <- as.character(Dataset$average_grades )
print(Dataset)
We transformed the parent_age attribute into intervals by dividing the values to be fall on one of two possible interval labels with equal width which is(40,50],(50,60] by discretization the values well be simpler to classify or perform other methods that can help us later in our model.
and to better utilize and interpret the grades attributes for each student, we have converted the numeric grades into letter grades (A+, A, B+, B, C+, C, D+, D, F). This transformation was undertaken to focus on the general letter grade representation rather than the precise numerical values.
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="TRUE"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="True"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="FALSE"]<-0
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="False"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="TRUE"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="True"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="FALSE"]<-1
Dataset$will_go_to_college[Dataset$will_go_to_college=="False"]<-1
Dataset$gender[Dataset$gender=="Female"]<-1
Dataset$gender[Dataset$gender=="Male"]<-0
Dataset$school_accreditation[Dataset$school_accreditation=="A"]<-1
Dataset$school_accreditation[Dataset$school_accreditation=="B"]<-0
Dataset$interest[Dataset$interest=="Very Interested"]<-4
Dataset$interest[Dataset$interest=="Interested"]<-3
Dataset$interest[Dataset$interest=="Less Interested"]<-2
Dataset$interest[Dataset$interest=="Not Interested"]<-1
Dataset$interest[Dataset$interest=="Uncertain"]<-0
Dataset$type_school[Dataset$type_school=="Academic"]<-1
Dataset$type_school[Dataset$type_school=="Vocational"]<-0
Dataset$residence[Dataset$residence=="Urban"]<-1
Dataset$residence[Dataset$residence=="Rural"]<-0
print(Dataset)
Since encoding is an important step in data preprocessing that enables the use of categorical data in various data analysis and machine learning tasks, we encoded attributes like the ‘parent was in college’ attribute from (True, False) to (1, 0), and ‘will go to college’ from (True, False) to (0, 1). This encoding is carried out as we aim to predict the influencing factors. Additionally, we encoded the ‘gender’ attribute from (Female, Male) to (1, 0), ‘school accreditation’ from (A, B) to (1, 0), ‘type_school’ from (Academic, Vocational) to (1, 0), ‘residence’ from (Urban, Rural) to (1, 0), and ‘interest’ from (Very interested ,Interested , Less Interested , Not Interested ,Uncertain ) to (4,3,2, 1, 0) respectively. Encoding serves to simplify the data, reduce complexity, and enhance its suitability for modeling purposes.
#1
C=chisq.test(Dataset$type_school , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$type_school and Dataset$will_go_to_college
X-squared = 1.0751, df = 1, p-value = 0.2998
#2
C=chisq.test(Dataset$school_accreditation , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$school_accreditation and Dataset$will_go_to_college
X-squared = 0.78513, df = 1, p-value = 0.3756
#3
C=chisq.test(Dataset$gender , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$gender and Dataset$will_go_to_college
X-squared = 1.0249, df = 1, p-value = 0.3114
#4
C=chisq.test(Dataset$interest , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test
data: Dataset$interest and Dataset$will_go_to_college
X-squared = 73.337, df = 4, p-value = 4.477e-15
#5
C=chisq.test(Dataset$residence , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$residence and Dataset$will_go_to_college
X-squared = 0.016098, df = 1, p-value = 0.899
#6
C=chisq.test(Dataset$average_grades , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test
data: Dataset$average_grades and Dataset$will_go_to_college
X-squared = 261.89, df = 4, p-value < 2.2e-16
#7
C=chisq.test(Dataset$parent_was_in_college , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$parent_was_in_college and Dataset$will_go_to_college
X-squared = 2.1194, df = 1, p-value = 0.1454
All the attributes have X-square greater than the p-value which indicate a some association with the class label; therefore we reject the null hypothesis
we noticed for ‘interest’ and ‘average grade’ the analysis shows that X-square is much larger than p-value indicate the significant association of the two attributes with the decision of the student to go to the collage or not
biserial.cor(Dataset$parent_salary,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4756928
biserial.cor(Dataset$house_area,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4672669
biserial.cor(Dataset$parent_age,Dataset$will_go_to_college,c("all.obs", "complete.obs"), level = 1)
[1] 0.04287336
the analysis shows moderate correlation coefficient for parent salary and house area with the class label which indicate that they are relevant factors meaning that the higher the parent salary and the larger house area the higher probability for a student to enroll in a collage
where is the on other hand, the correlation coefficient for the parent age is very small which indicate that the parent age has little impact to the probability for student to enroll in a collage
ultimately based on the analysis of the correlation that we conducted on the relationship of the dataset attributes with the class label, and the understanding of the data and the context of each attribute and potential relevance to the class label we decided to not delete any of the attribute
The attribute selection measure is a heuristic technique employed to select the most optimal criteria for dividing a collection of data tuples into partitions that are as pure as possible. Several attribute selection measures are commonly utilized, such as Information Gain (ID3), Gain Ratio (C4.5), and Gini Index (CART). we will constraints for Each of these measures different partitions by dividing them in ratios of 50%-50%, 70%-30%, and 80%-20%.
data<-Preprocessed_dataset
data$will_go_to_college <- as.factor(data$will_go_to_college)
data$residence <- as.factor(data$residence)
data$gender <- as.factor(data$gender)
data$parent_was_in_college <- as.factor(data$parent_was_in_college)
data$interest <- as.factor(data$interest)
data$type_school <- as.factor(data$type_school)
data$school_accreditation <- as.factor(data$school_accreditation)
data$average_grades <- as.factor(data$average_grades)
library(tidyverse)
library(caret)
data$will_go_to_college<- as.numeric(data$will_go_to_college)
hist(data$will_go_to_college,col="coral")
prop.table(table(data$will_go_to_college))
1 2
0.5 0.5
we want to confirm that the distribution between the two label data is not too much different. Because imbalanced datasets can lead to imbalanced accuracy.
Fortunately ,our data is balanced
| 70% training, 30% testing |
|---|
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.70 , 0.30))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 705 11
dim(test_data)
[1] 295 11
We partition the data the data into (70% training, 30% testing). This result in 705 object in the training set and 295 object in the testing set.
library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college
dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
0 1
0 309 48
1 37 311
plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
Confusion Matrix and Statistics
testPred 0 1
0 136 34
1 18 107
Accuracy : 0.8237
95% CI : (0.7754, 0.8655)
No Information Rate : 0.522
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6451
Mcnemar's Test P-Value : 0.03751
Sensitivity : 0.7589
Specificity : 0.8831
Pos Pred Value : 0.8560
Neg Pred Value : 0.8000
Prevalence : 0.4780
Detection Rate : 0.3627
Detection Prevalence : 0.4237
Balanced Accuracy : 0.8210
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 82.37% |
| Error rate | 17.63% |
| Sensitivity | 75.89% |
| Specificity | 88.31% |
| Precision | 85.60% |
In this decision tree model, the average grade with the highest information gain is selected as the root attribute. This means that the average grade is considered the most informative feature for classifying the data.
To enhance the model’s predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, and house_area. By considering these attributes, the model aims to capture more relevant information from the data and improve its classification accuracy.
The model then utilizes the house area and average grades as splitting criteria to separate the data into different branches of the tree in multiples levels
However, it’s worth noting that the decision tree in this case is considered to have a low level. This implies that the tree is relatively shallow and may not be able to capture intricate patterns or relationships present in the data. For example gender, parent_was_in_college, parent age, and school accreditation are not included in the tree due to limited availability of the data or their lower information gain values.
library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
0 1
0 322 24
1 27 332
plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 128 32
1 26 109
Accuracy : 0.8034
95% CI : (0.7534, 0.8472)
No Information Rate : 0.522
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6053
Mcnemar's Test P-Value : 0.5115
Sensitivity : 0.7730
Specificity : 0.8312
Pos Pred Value : 0.8074
Neg Pred Value : 0.8000
Prevalence : 0.4780
Detection Rate : 0.3695
Detection Prevalence : 0.4576
Balanced Accuracy : 0.8021
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 80.34% |
| Error rate | 19.66% |
| Sensitivity | 77.30% |
| Specificity | 83.12% |
| Precision | 80.74% |
In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for gender. This means that the decision tree takes into account a wider range of factors to make its classifications. The attribute with the highest Gain ratio, which measures the effectiveness of a split, is chosen as the root attribute. In this case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute.
The model then repeatedly uses average grades as the splitting criterion at different levels of the tree. This indicates that the average grades are considered crucial for classifying the data and further subdividing the branches.
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)
rpart.plot(cart_fit)
testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 125 31
1 29 110
Accuracy : 0.7966
95% CI : (0.7461, 0.8411)
No Information Rate : 0.522
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5922
Mcnemar's Test P-Value : 0.8973
Sensitivity : 0.7801
Specificity : 0.8117
Pos Pred Value : 0.7914
Neg Pred Value : 0.8013
Prevalence : 0.4780
Detection Rate : 0.3729
Detection Prevalence : 0.4712
Balanced Accuracy : 0.7959
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 79.66% |
| Error rate | 20.34% |
| Sensitivity | 78.01% |
| Specificity | 81.17% |
| Precision | 79.14% |
In this particular partition of the decision tree, Parent salary was once again chosen as the root attribute due to its high Gain index.
Gender and residence, on the other hand, were not included in this partition of the tree. Without sufficient data or a significant information gain associated with these attributes, the model may not have enough evidence to utilize them for accurate classification.
It’s worth noting that gender was not included in this partition, and it is important to highlight that the accuracy of this specific model was relatively lower, at 79.66%.
| 80% training, 20% testing |
|---|
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.80 , 0.20))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 802 11
dim(test_data)
[1] 198 11
We partition the data the data into (80% training, 20% testing). This result in 802 object in the training set and 198 object in the testing set.
library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college
dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
0 1
0 344 41
1 56 361
plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
Confusion Matrix and Statistics
testPred 0 1
0 81 13
1 19 85
Accuracy : 0.8384
95% CI : (0.7796, 0.8868)
No Information Rate : 0.5051
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6769
Mcnemar's Test P-Value : 0.3768
Sensitivity : 0.8673
Specificity : 0.8100
Pos Pred Value : 0.8173
Neg Pred Value : 0.8617
Prevalence : 0.4949
Detection Rate : 0.4293
Detection Prevalence : 0.5253
Balanced Accuracy : 0.8387
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 83.84 % |
| Error rate | 16.16% |
| Sensitivity | 86.73% |
| Specificity | 81.00% |
| Precision | 81.73% |
In this partition of the decision tree, the average grade with the highest information gain is selected as the root attribute. The selection is based on the fact that the average grade attribute has the lowest conditional entropy, the model aims to create a more effective split and improve the classification accuracy. also to mention that attribute was not included in the tree gender,parent was in collage, parent age
On a positive note, the accuracy in this partition has increased by 1.47% compared to the previous partition. This improvement in accuracy can be attributed to the fact that the training data has been increased by 10%. By having more data for training, the model has a larger and more diverse set of examples to learn from, allowing it to make more accurate predictions.
library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
0 1
0 378 22
1 16 386
plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 76 15
1 24 83
Accuracy : 0.803
95% CI : (0.7407, 0.856)
No Information Rate : 0.5051
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6064
Mcnemar's Test P-Value : 0.2002
Sensitivity : 0.8469
Specificity : 0.7600
Pos Pred Value : 0.7757
Neg Pred Value : 0.8352
Prevalence : 0.4949
Detection Rate : 0.4192
Detection Prevalence : 0.5404
Balanced Accuracy : 0.8035
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 80.03% |
| Error rate | 19.97% |
| Sensitivity | 84.69% |
| Specificity | 76.00% |
| Precision | 77.57% |
In this particular model that utilizes the gain ratio, it is observed that the decision tree is once again more complex compared to the other models. The attribute chosen as the root attribute is parent salary, based on its high purity and resulting high gain ratio. However, it is important to note that the gender attribute was not included in the decision tree. Ultimately, the gain ratio model’s construction involved careful considerations and prioritized the parent’s salary as root
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)
rpart.plot(cart_fit)
testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 85 23
1 15 75
Accuracy : 0.8081
95% CI : (0.7462, 0.8605)
No Information Rate : 0.5051
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6158
Mcnemar's Test P-Value : 0.2561
Sensitivity : 0.7653
Specificity : 0.8500
Pos Pred Value : 0.8333
Neg Pred Value : 0.7870
Prevalence : 0.4949
Detection Rate : 0.3788
Detection Prevalence : 0.4545
Balanced Accuracy : 0.8077
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 80.81% |
| Error rate | 19.19% |
| Sensitivity | 76.53% |
| Specificity | 85.00% |
| Precision | 83.33% |
for the first time house area was chosen as the root due to its high purity, as indicated by the high Gini index.
The model relies on the parent_salary attribute to split the tree multiple times, indicating its importance in determining the classification. This suggests that the parent’s salary has a significant impact on whether a student will go to college or not, and it is repeatedly used as a criterion for further branching in the tree.
It’s worth noting that 20% of the dataset follows the rule:
IF house_area < 0.54 and parent_salary < 0.42, THEN will go to college = 1.
This rule implies that when the house area is below a certain threshold and the parent’s salary is below another threshold, there is a higher likelihood of the student going to college.
Moreover, 28% of the dataset follows the rule:
IF house_area >= 0.54 and parent_salary >= 0.37 and type_school = 1, THEN will go to college = 1. This rule suggests that when the house area is above a certain threshold, the parent’s salary is above another threshold, and the type of school is 1, there is a higher probability of the student going to college.
Overall, these rules cover a significant portion of the dataset, 48% of it. This indicates that a large amount of data can be classified based on these specific rules, providing valuable insights into the factors that influence a student’s likelihood of attending college. we only mentioned those two rules duo to their high percentage of data
| 50% training, 50% testing |
|---|
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.50 , 0.50))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 493 11
dim(test_data)
[1] 507 11
We partition the data the data into (50% training, 50% testing). This result in 493 object in the training set and 507 object in the testing set.
library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college
dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
0 1
0 211 23
1 40 219
plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result, positive = "1")
print(co_result)
Confusion Matrix and Statistics
testPred 0 1
0 196 27
1 53 231
Accuracy : 0.8422
95% CI : (0.8075, 0.8729)
No Information Rate : 0.5089
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6837
Mcnemar's Test P-Value : 0.005189
Sensitivity : 0.8953
Specificity : 0.7871
Pos Pred Value : 0.8134
Neg Pred Value : 0.8789
Prevalence : 0.5089
Detection Rate : 0.4556
Detection Prevalence : 0.5602
Balanced Accuracy : 0.8412
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 84.22% |
| Error rate | 15.78% |
| Sensitivity | 89.53% |
| Specificity | 78.71% |
| Precision | 81.34% |
In this decision tree model, the root attribute is determined by selecting the average grade with the highest information gain. This indicates that the average grade is considered the most informative feature for classifying the data, as it provides significant insights into the classification process.
To improve the model’s predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, house_area, and whether the parent was in college. By considering these attributes, the model aims to capture more relevant information and enhance its ability to classify the data accurately.
In the tree structure, the model utilizes both the house area and average grades as splitting criteria in multiple levels. This means that the data is divided into different branches of the tree based on these attributes, allowing for more refined classification.
gender, parent age, and school accreditation are not included in the tree due to either limited availability of the data or their lower information gain values, indicating that they may have less impact on the classification process according to the model’s evaluation.
library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
0 1
0 234 17
1 13 229
plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 206 32
1 43 226
Accuracy : 0.8521
95% CI : (0.8181, 0.8818)
No Information Rate : 0.5089
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7038
Mcnemar's Test P-Value : 0.2482
Sensitivity : 0.8760
Specificity : 0.8273
Pos Pred Value : 0.8401
Neg Pred Value : 0.8655
Prevalence : 0.5089
Detection Rate : 0.4458
Detection Prevalence : 0.5306
Balanced Accuracy : 0.8516
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 85.21% |
| Error rate | 14.79% |
| Sensitivity | 87.60% |
| Specificity | 82.73% |
| Precision | 84.01% |
In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for parent age. Notably, for the first time, gender is included as an attribute in the tree. This indicates that the decision tree now takes into account a wider range of factors, including gender, to make its classifications. By incorporating these additional attributes, the model aims to capture more nuanced patterns and relationships in the data.
The attribute with the highest Gain ratio is chosen as the root attribute. In this particular case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute. This suggests that Parent salary provides the most significant reduction in uncertainty and is considered crucial for classifying the data accurately.
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)
rpart.plot(cart_fit)
testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 204 30
1 45 228
Accuracy : 0.8521
95% CI : (0.8181, 0.8818)
No Information Rate : 0.5089
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7037
Mcnemar's Test P-Value : 0.106
Sensitivity : 0.8837
Specificity : 0.8193
Pos Pred Value : 0.8352
Neg Pred Value : 0.8718
Prevalence : 0.5089
Detection Rate : 0.4497
Detection Prevalence : 0.5385
Balanced Accuracy : 0.8515
'Positive' Class : 1
this matrix below show the evaluation model on testing data
| Evaluation method | Method |
|---|---|
| Accuracy | 85.21% |
| Error rate | 24.79% |
| Sensitivity | 88.37% |
| Specificity | 81.93% |
| Precision | 83.52% |
27% of the dataset follows the rule:
IF average_grades=+A,+B,A and parent_salary>=0.37 and house_area>=0.41 and type_school=1, THEN will_go_to_collage=0.
Furthermore, 17% of the dataset follows the rule:
IF average_grades=B,+C and house_area<=0.41, THEN will_go_to_collage=1.
These rules highlight the complexity and variability in determining whether a student will go to college. Even if a student has stable financial circumstances and good grades, it does not guarantee their enrollment in college. in contract to the second which mean Other factors could play a role in the decision-making process.
we only mentioned those two rules duo to their high percentage of data
Information Gain is a measure used in the ID3 algorithm for attribute selection in decision tree learning. It quantifies the amount of information obtained about the class variable when a given attribute is used to split the data. The Information Gain is calculated by substract the entropy Expected information needed to classify a tuple in D a with Conditional Entropy Information needed after using A to split D into v partitions, to classify D . A higher Information Gain indicates that the attribute is more informative and should be chosen as the splitting criterion.
for our trees,
average grade was chosen as the root attribute for all 3 trees with the highest information gain This attribute Reflects the least randomness or “impurity” in these partitions and guarantees that a simple tree is found.
| 50%-50% | 70%-30% | 80%-20% | |
|---|---|---|---|
| Accuracy | 84.22% | 82.37 % | 83.84 % |
| Error rate | 15.78% | 17.63% | 16.16% |
| Sensitivity | 89.53% | 75.89% | 86.73% |
| Specificity | 78.71% | 88.31% | 81.00% |
| Precision | 81.34% | 85.60% | 81.73% |
the evaluation method table shows that the best partiton for information gain shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with precntages 84.22% and Sensitivity (True positive recognition rate) with 89.53% but the lowest Specificity(True negative recognition rate) with 78.71%. overall it’s the best model for information gain has the lowest error rate 15.78%
Gain Ratio is an normaliztion for Information Gain and is used in the C4.5 algorithm, which is an extension of ID3. While Information Gain tends to favor attributes with many multivariate It is calculated by dividing the Information Gain by SplitInfo a(D).
for our trees,
Parent salary was chosen as the root attribute for all 3 trees due to its high Gain ratio, the trees split are unbalanced since gain ratio Tends to prefer unbalanced splits in which one partition is much smaller than the others
| 50%-50% | 70%-30% | 80%-20% | |
|---|---|---|---|
| Accuracy | 85.21% | 80.34% | 80.03% |
| Error rate | 14.79% | 19.66% | 19.97% |
| Sensitivity | 87.60% | 77.30% | 84.69% |
| Specificity | 82.73% | 83.12% | 76.00% |
| Precision | 84.01% | 80.74% | 77.57% |
the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percntages 85.21% and Sensitivity (True positive recognition rate) with 87.60% and Precision(tuples labeled as positive are actually positive) with percentages 84.01% but the lowest Specificity(True negative recognition rate) with 82.73%. overall it’s the best model dor gain ratio with the lowest error rate 14.79%
Gini Index is a criterion used in the CART (Classification and Regression Trees) algorithm for attribute selection. It measures the impurity of a set of tuples, with lower values Gini a(D) indicating higher purity. The Gini Index is calculated by summing the squared probabilities of each class label in the set
for our trees,
gini index had different root (parent salary/house area/averages grades) for each tree
| 50%-50% | 70%-30% | 80%-20% | |
|---|---|---|---|
| Accuracy | 85.21% | 79.66% | 80.81% |
| Error rate | 14.79% | 20.34% | 19.19% |
| Sensitivity | 88.37% | 78.01% | 76.53% |
| Specificity | 81.93% | 81.17% | 85.00% |
| Precision | 83.52% | 79.14% | 83.33% |
the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percentages 85.21% and Sensitivity (True positive recognition rate) with 88.37% and Precision(tuples labeled as positive are actually positive) with percentages 83.52% overall it’s the best model for gini index with the lowest error rate 14.79%
noticeable notes
important to note that gender was not included in 7 of 9 trees which
parent’s salary was the root of 4 of 9 trees
average grades was the root of 4 of 9 trees
house area was the root only once 1 of 9 tree
In summary, the analysis of the decision trees reveals that financial circumstances, as indicated by parent’s salary, and academic performance, as indicated by average grades, are crucial factors for contemporary universities when predicting college enrollment. In this particular dataset, gender is not considered highly important in the decision-making process.
the 50% training, 50% testing was the best partition for all attribute selection measures with accuracies 85.21%. for both gini index and gain ratio
Both attribute selection measures yield the same accuracy and error rate, so we need to check the remaining evaluation metrics that is provided:
| gain ratio | gini index | |
|---|---|---|
| Accuracy | 85.21% | 85.21% |
| Error rate | 14.79% | 14.79% |
| Sensitivity | 87.60% | 88.37% |
| Specificity | 82.73% | 81.93% |
| Precision | 84.01% | 83.52% |
Comparing the sensitivity, specificity, and precision:
- Sensitivity: Gini Index has a higher sensitivity value (88.37%) compared to Gain Ratio (87.60%). Higher sensitivity indicates a better ability to correctly identify positive instances so gini index is better regarding Sensitivity.
- Specificity: Gini Index has a lower specificity value (81.93%) compared to Gain Ratio (82.73%). Higher specificity indicates a better ability to correctly identify negative instances.
- Precision: Gain Ratio has a higher precision value (84.01%) compared to Gini Index (83.52%). Higher precision indicates a better ability to correctly classify positive instances.
so gain ratio is better regarding Specificity and Precision .
Based on these comparisons, it is difficult to determine definitively which attribute selection measure is the best. The choice between Gini Index and Gain Ratio may depend on the specific requirements and priorities of the problem but since we focus on the true negative more than true positive because our goal of classification to determine who will not go to collage, so we will choose the model based on the Specificity and Precision (Gain ratio)
our final tree is:
In this analysis, we apply K-means clustering to the dataset using different values of K. K-means clustering is an unsupervised learning algorithm that partitions the data into K clusters based on similarity. We will explore three different values of K and evaluate the clustering results using various metrics.
original_data <- Preprocessed_dataset
# Remove any non-numeric attributes
numeric_data <- original_data[, sapply(original_data, is.numeric)]
# Remove the class label 'will_go_to_college'
numeric_data <- numeric_data[, !(names(numeric_data) == 'will_go_to_college')]
Now, the ‘numeric_dataset’ dataset contains only numeric attributes without the class label, which makes it ready for the clustering process.
___________________________________________________________________________________
In our exploration of k-means clustering, we will start by examining the dataset with K=2, representing an initial attempt to partition the data into two distinct clusters. This initial analysis will serve as a foundational step in understanding the inherent structure and patterns
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 2)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)
NA
NA
NA
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 8669.059
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)
cluster_assignments <- kmeans.result$cluster
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.002333333
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1166667
Silhouette Width: 0.36
Within-Cluster Sum of Squares: 8669.059
BCubed Precision: 0.0023
BCubed Recall: 0.1167
So, the clusters are moderately compact with a reasonable silhouette width, but low precision and recall suggest potential misclassification issues.
___________________________________________________________________________________
Moving to K=5, our exploration in k-means clustering advances to a configuration with five clusters. This adjustment allows for a more detailed partitioning of the dataset, potentially revealing finer grained patterns and distinctions among the data points. The analysis at K=5 will offer insights into the nature of these clusters, assessing their characteristics, compactness, and separation, as we strive to optimize the clustering configuration for the specific nuances of our dataset.
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 4 clusters
kmeans.result <- kmeans(numeric_data, 5)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)
NA
NA
NA
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 4303.841
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)
cluster_assignments <- kmeans.result$cluster
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.001666667
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1333333
Silhouette Width: 0.29
Within-Cluster Sum of Squares: 4303.841
BCubed Precision: 0.00166
BCubed Recall: 0.1333
Final Analysis: Clusters are more compact than k=2 but less well-separated. Precision and recall remain low, indicating potential misclassifications.
___________________________________________________________________________________
Proceeding to k=7, our investigation in k-means clustering expands further as we explore a configuration with seven clusters. This adjustment aims to capture even more nuanced patterns and variations within the dataset. Analyzing the characteristics of the seven clusters will provide a more granular understanding of the underlying structure, potentially revealing subtleties that might not be as evident with fewer clusters. As we delve into k=7, our objective is to refine our clustering configuration to align more closely with the intricate details present in the dataset.
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 7)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)
NA
NA
NA
NA
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 3389.656
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)
cluster_assignments <- kmeans.result$cluster
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.0006666667
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.06666667
Silhouette Width: 0.27
Within-Cluster Sum of Squares: 3389.656
BCubed Precision: 0.00067
BCubed Recall: 0.067
Final Analysis: Clusters are less well-separated than k=2 and k=5 and while more compact than k=5, precision and recall remain low, indicating potential misclassifications.
___________________________________________________________________________________
We implemented the Elbow Method for determining the optimal number of clusters (k) in a k-means clustering algorithm.
# Function to calculate total within-cluster sum of squares (wss)
wss <- function(k) {
kmeans_result <- kmeans(numeric_data, centers = k, nstart = 10) # You can adjust nstart based on your preference
return(sum(kmeans_result$tot.withinss))
}
# Calculate the total within-cluster sum of squares for different values of k
k_values <- 1:10 # You can adjust the range of k values
wss_values <- sapply(k_values, wss)
# Plot the elbow curve
plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Total Within-Cluster Sum of Squares (WSS)",
main = "Elbow Method")
NA
NA
The elbow method has indicated that K=5 is a reasonable choice in terms of balancing the trade-off between capturing variance and not overly complicating the model with too many clusters.
| 50%-50% | 70%-30% | 80%-20% | |||||||
|---|---|---|---|---|---|---|---|---|---|
| ig | igr | gini | ig | igr | gini | ig | igr | gini | |
| Accuracy | 84.22% | 85.21% | 85.21% | 82.37 % | 80.34% | 79.66% | 83.84 % | 80.03% | 80.81% |
| Error rate | 15.78% | 14.79% | 14.79% | 17.63% | 19.66% | 20.34% | 16.16% | 19.97% | 19.19% |
| Sensitivity | 89.53% | 87.60% | 88.37% | 75.89% | 77.30% | 78.01% | 86.73% | 84.69% | 76.53% |
| Specificity | 78.71% | 82.73% | 81.93% | 88.31% | 83.12% | 81.17% | 81.00% | 76.00% | 85.00% |
| Precision | 81.34% | 84.01% | 83.52% | 85.60% | 80.74% | 79.14% | 81.73% | 77.57% | 83.33% |
Accuracy (recognition rate): Percentage of test set tuples that are correctly classified
Error rate (misclassification rate): 1 – accuracy
Sensitivity (recall): True positive recognition rate
Specificity: True negative recognition rate
Precision (exactness): What % of tuples labeled as positive are actually positive.
For partition 50%-50% : gain ratio and gini index equals in accuracy 85.21% but as we mention before in best model we choose based on Precision 84.01% and Specificity 82.73% , Gain rato is the best model.
For partition 70%-30%: information gain is the best model for this partition with accuracy 82.37 %
For partition 80%-20%: information gain is the best model for this partition with accuracy 83.84 %
overall 50%-50% wasa the best partition with algorithm gini index with accuracy 85.21%.
| Metric | K=2 | K=5 | K=7 |
|---|---|---|---|
| Average Silhouette Width | 0.36 | 0.29 | 0.27 |
| Total Within-Cluster Sum of Squares | 8669.059 | 4303.841 | 3389.656 |
| BCubed Precision | 0.002333333 | 0.001666667 | 0.00067 |
| BCubed Recall | 0.1166667 | 0.1333333 | 0.067 |
In our overall analysis, we evaluated the performance of the k-means clustering algorithm for three different values of k (2, 5, and 7). The key metrics provide insights into the characteristics of the resulting clusters:
Average Silhouette Width:
Total Within-Cluster Sum of Squares:
BCubed Precision and Recall:
Considering the results obtained from the elbow method, which identifies k=5 as the optimal number of clusters, our analysis aligns with this recommendation. The detailed examination of silhouette width, within-cluster sum of squares, and BCubed metrics reinforces the choice of k=5 as a suitable configuration, striking a balance between cluster distinctiveness and internal cohesion. This comprehensive evaluation supports the decision to proceed with k=5 for a more refined and effective clustering solution.
In the following figures, the visualisation of the clusters for each trial:
K=2:
k=5
K=7
classification offers a more interpretable and actionable solution compared to clustering, as it assigns clear labels to data points. In addition, the availability of class labels enhances the model’s ability to generalize and make informed predictions. While clustering struggle with overlapping clusters and lack of clear evaluation metrics, classification leverages the labeled data to optimize performance and provide meaningful insights. Therefore, for our dataset, the utilization of classification is not only justified by its high accuracy but also by its ability to address the specific characteristics and challenges presented in the data.